Goal

  • Deliverable D1/D2:
    • Release of an “URL finding R package” for statistical web scraping
  • Main goals:
    • Minimal dependencies / requirements to get started
    • Good documentation and help should be included
    • Scraping jobs should be easily configurable
  • What do we already have?
    • internal R scripts used to scrape data for different projects at STAT
    • many desired features (e.g parallelization) already implemented
    • but depends quite a lot on local infrastructure

Statistical (Selective) Scraping

  • Multiple Packages for web-scraping exist

  • They make it easy to extract the source or specific contents (tables) of given websites

  • The desired workflow is slightly different:

    1. Obtain targets
      • e.g IDs from Business Register
      • Information (e.g Contact-Information) that should be searched for
    2. Extract target URls by perform Web Search
      • Example: use Google Search API
    3. Run a customizable Scraper
      • must take URLs, Parameters, Keywords into account
    4. Connect scraped information with target IDs
      • can be done directly or using a statistical model

Development goals

  • Make use of existing frameworks/packages as much as possible
  • Web-Scraping libraries often uses of Selenium
    • a framework to drive “headless” browsers using a well-defined API
    • typically used for testing purposes
    • customizable (e.g. different browsers with different versions)
  • Make use of popular, well-known R-packages (where it makes sense)
  • Provide helpful instructions and documentation for users
    • Man-Pages, Vignettes and/or a pkgdown-Page
  • Allow quick onboarding by requiring a running Selenium instance
    • can easily be done in a dockerized setup
    • will also be documented

Desired Features (1)

  • Customizability
    • each step can be customized
    • customization for each function should be harmonized by making use of specific packages (e.g config)
    • save/restore logic for parameters so that such objects can be restored for re-occuring projects
  • Parallelization
    • allow parallel scraping tasks by making use of existing selenium-archtecture (hub/nodes)
  • Reproducability
    • Utility functions could be provided that create docker-compose Files for specific node and/or browser versions using available tags

Desired Features (2)

  • Restart + Retry-Logic
    • Implement a logic that manages a pool of URLs to be scraped
    • Allow parameters that define how often URLs should be tried before giving up
    • Allow pause/restart operations if needed
  • Customizable outputs
    • Outputs could be written to plain files (rds, json)
    • or (file-based) Databases (duckdb)
    • Functionality will be implemented that allows to read/write such outputs from/to all supported formats

Initial Work

  • removed from available R-code all STAT-related internal dependencies (packages) and settings and moved those to an additional (internal) package [done]
  • Create a R-package according to state-of-the art suggestions
    • Include unit-tests using testthat
    • Document everything using roxygen2
    • Write package vignettes and examples
    • Create a pkgdown-page [while coding]
    • Create a CI deployment pipeline (e.g on github?)

Howto implement Modelling Feature

  • Final step (linking scraped data to target IDs using a model) is the most difficult for generic implementation.
    • Option A: Aim to provide a generic, built-in model to assist users.
    • Option B: Let users choose their own models and provide documentation on potential methods.
  • Next Steps: We need to evaluate the feasibility of a universal solution versus the value of flexible guidance.
    • Preferences?

Other relevant choices?

  • Should defaults be provided? Some current values seem to be
    • country (e.g. values for legal forms of enterprises) and/or
    • language specific
  • What further features would be beneficial? How to prioritize?
  • What should/could be parametrized?
  • Idea:
    1. Write boilerplate functions and possible parameters
    2. Implementation phase
  • How could we get a discussion going? \(\rightarrow\) github?





Thanks for listening!

Questions?